Starter Information

Package loading

Importing the data

Variables present in the base data set

Data Processing and Summarization

Subsetting the Data

We chose three main buckets of factors on which to analyze how income in 2017 was affected by gender in the National Longitudinal Study of Youth in 1997:

  1. Demographic information (age, race, physical or emotional condition limiting work, citizenship status, urban/rural residence, number of children in the household under the age of 6, and marital status)

  2. Educational background (the highest educational attainment of mother and father, and the respondent’s highest degree received)

  3. Criminal history (total number of incarcerations and the length of the longest period of incarceration).

Variables Considered But Not Included

The NLSY dataset offers 95 variables for analysis. Prior to selecting a subset of variables, we considered and discussed many others.

Notably, the dataset includes separate variables for educational achievements of biological parents. For this analysis, residential parents were selected to assess the impact of the respondents’ home environment on future income.

The dataset includes a variable that measures if the respondent is unhappy, sad, or depressed. Instead of using this variable, we chose the variable that measures if the respondent has a physical or mental condition that limits school or work. condition.limiting.work is an objective measure of conditions that may impact future earnings. The unhappy, sad, or depressed variable was coded as “not true”, “sometimes true” or “often true”, which was a more subjective variable than the condition.limiting.work variable, for which respondents either gave a “yes” or “no” response to this variable.

We chose our criminal history variables, num.times.incarc and longest.length.incarc.months on the basis that we wanted objective datapoints of a respondent’s incarceration history. There is a variable in the larger dataset that examined the respondent’s perception of the likelihood that they would be incarcerated by age 20. The question that generated this variable was asked when the participants were teenagers in 1997. Due to the high level of subjectivity in answering the “percent chance of incarceration” variable, we thought that num.times.incarc and longest.length.incarc.months would be better suited for our analysis.

When looking for household conditions that could variably impact income between men and women, we thought that the number of children in the household could have an impact on income. We initially considered using a variable that measured the amount of biological children for each respondent. We ultimately decided to use the num.children.under.6 variable, however, because we were interested in understanding how enhanced childcare needs for younger children had a different impact on income based on sex. Additionally, the number of biological children that a respondent had didn’t necessarily mean that the child actually lived in the household. Therefore, we chose num.children.under.6 because it provides a better snapshot of household conditions.

Finally, we also considered looking at the financial assets and debt that participants held. We ultimately decided not to include these variables as they would likely be highly correlated with the income variable. In order to reduce the introduction of collinearity into our model, we chose not to include these variables. Additionally, some of the financial variables might be considered dependent outcomes as well - this was another reason for their exclusion.

Recoding the Subset of Variables

Recoding Methodology

For the categorical variables, the negative values were recoded based on how they appeared in the codebook, i.e. -4 = “Invalid Skip”, etc.

For the numerical variables, the negative values that corresponded to “Non-Interview”, or “Invalid Skip” were recoded as “NA” because there was no useful interpretation for these values in the larger context of the variable. For some of the negative values corresponding to a “Valid Skip”, these were recoded as “0”. Further discussion of the recoding methodology is explained for each of the 14 variables selected for analysis.

Many variables were recoded as factor variables. This was done in cases where the order of the categories mattered (e.g., residential.dad.highest.grade.bucket) or to select a specific baseline group.

Processing Income Topcoding

In order to do further analysis later in our project, we also created a subset of the dataset where the topcoding on the income variable was removed. The income for the minimum salary in the top 2% (149,000), which was also used as truncation for topcoding, was left in this subset as this represented a respondent’s exact income rather than the average used for topcoded incomes.

We also created a subset of the data that contains only the respondents with topcoded income. This group contains the “high earners” and there is some independent exploration of this group to understand how they deviate from the other respondents.

Tabular & Graphical Summaries

Main Independent Variable: Sex

Sex Overall Count Percentage of Total Population
Female 4,385 48.81
Male 4,599 51.19

From the summary data, we can see that there is a roughly equal percentage of women and men in the subset of dataset that will be used as the basis for our analysis. The dataset is 48.8% female and 51.2% male.

Main Dependent (Outcome) Variable: Income

When analyzing the income variable, we started by eliminating the negative values. This rationale was based on the fact that the value of “-5” corresponded to a “Non-Interview” which meant that 2,250 participants were not even asked this particular question, therefore it would not make sense to include these respondents in this subset of the dataset.

Additionally, the value of “-1” corresponded to a “Refusal” to answer this question, which again would not have a useful interpretation in the context of the remaining values available for this question.

A value of “-2” corresponded to “Don’t Know”. Respondents who provided this answer were then prompted to answer another question about their estimated income for 2017 but not all participants who were sent to the “estimated income from wages and salary in the past year” question provided a response, which again eliminated the ability to include these respondents in our income variable.The information captured from the question that generated the “income” variable and the question that generated the “estimated income” variable are also not measured with the same methodolgy. The income variable provides exact numbers of the income that the respondent made, while the question directly following the “total income from wages and salary in the past year” only provides responses in pre-defined intervals, with no access to the underlying values.

The value of “-4” for the income variable corresponded to a “Valid Skip”, although it is not clear what the criteria was that allowed a participant to skip this question. After exploring the negative values for income a bit further, we decided that the best course of action would be to recode the negative values for the income varible as “NA”.

After completing our recoding, we then did some graphical analysis of the distribution of income in the dataset.

Graphical Summary of Income

Breakdown of Income by Sex
Sex Count $0 (in %) $1-$50,000(in %) $50,000-$100,000(in %) $100,000-$149,000(in %) >$149,000(in %)
Female 4385 0.39 39.04 11.88 1.80 0.68
Male 4599 0.30 29.68 17.70 3.78 1.98

The income distribution between men and women appears to be roughly identical up to around $50,000. After this point, men appear to make up a larger percentage of those earning in the 50,000-100,000, 100,000-149,000, and the greater than 149,000 dollar ranges.

We know that this outcome variable is topcoded for the top 2% of respondents in this dataset. So anyone who earns in the top 2% of respondents in the dataset, which corresponds to a value of $ 149,000, is assigned a value that is an average of the top 2%.

To test whether or not the difference in income by sex is statistically significant, we conducted the following t-test.

## 
##  Welch Two Sample t-test
## 
## data:  income by sex
## t = -14.346, df = 4876.8, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -18099.96 -13747.84
## sample estimates:
## mean in group Female   mean in group Male 
##             41278.92             57202.82

From the results of this t-test, it appears that the difference in the mean incomes of men and women is statistically significant, at a significance level of 0.05. The p-value for this test is 9.509106310^{-46}.

Analyzing the Demographic Distribution of Top Earners

Distribution of High-Earners by Sex
Sex Count
Female 30
Male 91
Distribution of High-Earners by Sex
Sex Race Count
Non-Black/ Non-Hispanic Female 24
Non-Black/ Non-Hispanic Male 72
Black Female 5
Black Male 9
Hispanic Female 1
Hispanic Male 10

Of the 121 “high earners,” only 30 were women. The remaining 91 were men, representing 75.21% of top earners. 96 of the 121 “high earners” were Non-Black/ Non-Hispanic.

Distribution of High-Earners by Highest Degree Earned
Highest Degree Earned Count Proportion
No Degree Earned 3 0.02
GED 2 0.02
High School Diploma 10 0.08
Associate 3 0.02
Bachelors 56 0.46
Masters 13 0.11
Professional Degree (DDS, JD, MD) 25 0.21
Non-Interview 9 0.07

Of the 121 “high earners,” 46.28% had a Bachelors degree, 10.74% had a Masters degree, and 20.66% had a Professional degree (DDS, JD, or MD). Interestingly, none of the 13 respondents with a PhD were in the top income group.

High earners account for only 1.35% of the dataset but they hold a disproportionate number of advanced degrees. High earners hold 3.83% of all Bachelors degrees, 3.81% of all Masters degrees, and 34.25% of all Professional degrees.

Respondents with income in the top 2% are not representative of the respondent group as a whole. It is important to better understand this group and consider its implications for the rest of the analysis. Since high earners tend to be male and more highly educated, it suggests that some linear regressions should not include this group or consider their interactions with sex, race, and highest.degree.earned.

Criminal History Variables

Number of Times Incarcerated

Recoding

Distribution of the Number of Times Participants Were Incarcerated
Possible Values of num.times.incarc Frequency
-3 21
0 8054
1 511
2 205
3 94
4 52
5 29
6 9
7 4
8 3
9 1
11 1

The summary table displayed shows that there is a negative value for the number of incarcerations in the original dataset. The original dataset code of “-3” refers to an invalid skip, which does not provide a useful interpretation in the context of this dataset. Therefore this negative value has been replaced by an “NA” value, which eliminates 21 observations from the dataset.

Tabular Summary of the Data

Summary of Incarceration Statistics by Sex
Sex Overall Count Median Number of Incarcerations Number with Criminal History Proportion with Criminal History
Female 4,385 0 185 0.042
Male 4,599 0 724 0.157

There appears to be a marked difference in the incarceration rates of men and women in this dataset. Particularly, 15.7% of the men in the dataset have experienced incarceration, while 4.2% of the women in the dataset have experienced incarceration.

While there is an overall difference in incarceration rates, the median number of incarcerations is the same for both men and women: 0. So, the majority of the male and female respondents have not been incarcerated.

To analysis whether the difference in the number of times of incarcerations between men and women was statistically significant, we conducted the following t-test.

## 
##  Welch Two Sample t-test
## 
## data:  num.times.incarc by sex
## t = -16.344, df = 6471.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.2613738 -0.2053901
## sample estimates:
## mean in group Female   mean in group Male 
##           0.06843066           0.30181262

The results of the t-test show that there is a statistically significant difference in the number of times that male and female study respondents have been incarcerated. The p-value for this t-test is 7.137864610^{-59}, which is less than the significance level of 0.05.

Longest Period of Incarceration

Recoding

There are two negative variables that appear in the numerical variable, “longest.length.incarc.months”. The first negative variable, “-4”, indicates that the respondent validly skipped answering this variable. In the context of the study, this would likely mean that the 8,054 participants who gave this response were never incarcerated, meaning that the length of their longest period of incarceration would be 0 months. Therefore, we recoded the values for this variable that were equal to -4 to 0.

The second negative variable, “-3” was recoded as “NA” because this value corresponds to an “inavlid skip” in the original dataset and therefore does not have a useful interpretation relative to the other values for the variable.

Graphical & Tabular Summary of the Data

Summary of Incarceration Length by Sex
Sex Overall Count Maximum Length of Incarcerations (in months) Maximum Length of Incarceration (in years) Average Length of Incarceration (in months)
Female 4,385 101 8.42 0.44
Male 4,599 207 17.25 3.76

The tabular and graphical summaries for the length of incarceration variable show that, not only are men more likely to have a criminal history than women in this dataset, they are also likely to spend more time in jail when they are incarcerated. The average length of time of incarceration for women is 0.44 months, while the average length of time of incarceration for men is 3.76 months.

Demographic Variables

Condition Limiting Work

Recoding

The condition.limiting.work variable was further recoded to change the values for which there was a “Valid Skip” to “No” because, based on the codebook, participants might not have been brought to answer this question if there was no previous indication that they might have some condition that would limit them from working. The other two categories, “Don’t Know” and “Refused to Answer” were recoded as “NA”, as these two categories don’t have a useful interpretation in the context of the rest of the variable.

With these adjustments to the condition.limiting.work variable, here is the adjusted bar graph of the breakdown in the presence of a work-limiting condition by gender:

Distribution of Condition Limiting Work by Sex
Sex Count Has a Work-Limiting Condition Proportion With A Work-Limiting Condition
Female 4,385 254 0.058
Male 4,599 316 0.069

Both the graphical and tabular summaries show that the majority of the respondents do not have a condition limiting work. Additionally, there appears to be a roughly equal proportion of men and women with work-limiting conditions. 5.79% of women had a work-limiting condition, while 6.87% of men had a work-limiting condition. This might signal that the effect of the this variable on differences in income by gender may not be substantial.

Age

Recoding

No recoding was required for the age variable because all respondents to the survey provided a response to the question from which this variable is generated.

Graphical & Tabular Summary

sex Overall Count Mean Age (in 2017)
Female 4385 34.00
Male 4599 33.98

From this tabular summary, we can see that both men and women have approximately the same mean age of 34. The age value that is used in this analysis is calculated by adding 20 in the participants age in 1997. This new age variable was created to be in line with the year from which the outcome variable, income, was calculated.

Sex Age in 2017 Count Proportion
Female 32 860 0.196
Female 33 873 0.199
Female 34 888 0.203
Female 35 927 0.211
Female 36 837 0.191
Male 32 911 0.198
Male 33 934 0.203
Male 34 953 0.207
Male 35 947 0.206
Male 36 854 0.186

From the bar graphs and the tabular summary, we can observe that the breakdown of age is approximately the same between men and women. Sex does not appear to have an influence on the distribution of age in the dataset.

Race/Ethnicity

To begin initially exploring the race variable, we created a tabular summary of racial/ethnic background by sex. This tells us if the sample we are working with was properly randomized on the race variable.

Breakdown of Race by Sex
Sex Race Count Proportion
Female Non-Black/ Non-Hispanic 2,252 0.5135690
Female Black 1,166 0.2659065
Female Hispanic 924 0.2107184
Female Mixed Race (Non Hispanic) 43 0.0098062
Male Non-Black/ Non-Hispanic 2,413 0.5246793
Male Black 1,169 0.2541857
Male Hispanic 977 0.2124375
Male Mixed Race (Non Hispanic) 40 0.0086975

As we can see from the bar chart and tabular summary the distribution of race/ethnicity is quite similar between men and women.

Now that we have a general idea of how the race variable breaks down by sex, we can explore this variable a little further. In order to get a general idea of how racial background effects income we created some buckets to place respondents in. We created a “0 - 5000” for those with very little or no income, and a “5000 - 20000” bucket to get an idea of whose income places them below the poverty line. After that, we used 10,000 ranges up to over 100,000 dollars which we set as our topcoded bucket.

Distribution of Income by Race and Sex
race sex Count Unknown $0 - $5,000 5,001 - $20,000 $20,001 - $30,000 $30,001 - $40,000 $40,001 - $50,000 $50,001 - $60,000 $60,001 - $70,000 $70,001 - $80,000 $80,001 - $90,000 $90,001 - $100,000 $100,001+
Non-Black/ Non-Hispanic Female 2252 987 82 217 196 195 168 117 73 65 39 32 81
Non-Black/ Non-Hispanic Male 2413 944 29 113 163 215 205 154 143 105 74 67 201
Black Female 1166 497 52 144 143 123 89 51 29 11 4 6 17
Black Male 1169 584 42 108 108 99 86 33 29 29 15 12 24
Hispanic Female 924 410 32 103 104 93 69 47 20 22 10 4 10
Hispanic Male 977 435 15 62 59 90 93 67 42 32 24 19 39
Mixed Race (Non Hispanic) Female 43 21 0 6 4 2 4 2 1 1 0 1 1
Mixed Race (Non Hispanic) Male 40 15 0 4 3 3 1 5 2 0 2 3 2

We can see that the counts are generally pretty varied but it’s hard to tell proportionally if income distribution is even across different racial backgrounds. To make variations in the dataset easier to see, we converted all of our counts into proportions of the total of each racial group.

Next, we calculated the proportions:

Distribution of Income by Race and Sex (As Proportions)
race sex Unknown $0 - $5,000 $5,001 - $20,000 $20,001 - $30,000 $30,001 - $40,000 $40,001 - $50,000 $50,001 - $60,000 $60,001 - $70,000 $70,001 - $80,000 $80,001 - $90,000 $90,001 - $100,000 Over100k
Non-Black/ Non-Hispanic Female 0.44 0.04 0.10 0.09 0.09 0.07 0.05 0.03 0.03 0.02 0.01 0.04
Non-Black/ Non-Hispanic Male 0.39 0.01 0.05 0.07 0.09 0.08 0.06 0.06 0.04 0.03 0.03 0.08
Black Female 0.43 0.04 0.12 0.12 0.11 0.08 0.04 0.02 0.01 0.00 0.01 0.01
Black Male 0.50 0.04 0.09 0.09 0.08 0.07 0.03 0.02 0.02 0.01 0.01 0.02
Hispanic Female 0.44 0.03 0.11 0.11 0.10 0.07 0.05 0.02 0.02 0.01 0.00 0.01
Hispanic Male 0.45 0.02 0.06 0.06 0.09 0.10 0.07 0.04 0.03 0.02 0.02 0.04
Mixed Race (Non Hispanic) Female 0.49 0.00 0.14 0.09 0.05 0.09 0.05 0.02 0.02 0.00 0.02 0.02
Mixed Race (Non Hispanic) Male 0.38 0.00 0.10 0.08 0.08 0.02 0.12 0.05 0.00 0.05 0.08 0.05

From this table we can tell there are some significant disparities present in the distribution of income by race. It seems clear from the data that Non-Black/Non-Hispanic men and women appear in lower proportions than other groups at incomes below $30,000. For these reasons we decided we will explore these varaibles further in our analysis. It appears that Black and Hispanic men and women are more highly concentrated in the lower-income brackets than their Non-Black, Non-Hispanic counterparts.

Marital Status

In our discussions of potential variables, marital status quickly emerged as an important aspect to consider. For this data, we used the 2011 marital status responses. Since there is no way to determine the marital status of those not interviewed or those who skipped the question, we set those values to NA.

Distribution of Marital Status by Sex
Sex Marital Status Total Count
Female Never-marrried 1,307
Female Married 1,636
Female Separated 79
Female Divorced 393
Female Widowed 19
Male Never-marrried 1,459
Male Married 1,430
Male Separated 75
Male Divorced 270
Male Widowed 4

It appears from our plot that more women are both married and divorced than men. Based on the error bars, the differences for these two marital statuses appear to be statistically significant.

Citizenship

Distribution of Citizenship by Sex
Sex Citizenship Status Count Proportion of Total
Female U.S. Citizen 3,331 0.37
Female Unknown, not born in the U.S. 132 0.01
Female Unknown, birthplace unknown 398 0.04
Female NA 524 0.06
Male U.S. Citizen 3,538 0.39
Male Unknown, not born in the U.S. 147 0.02
Male Unknown, birthplace unknown 396 0.04
Male NA 518 0.06

The tabular summary shows that 76% of respondents are U.S. citizens. Furthermore, the graphical summary suggests that a roughly equal proportion of men and women are U.S. citizens.

Urban/Rural Residence

Another demographic variable we chose identifies the respondent’s residence type when they were 12. This is categorized as either urban or rural. The setting in which the respondent grew up could limit the resources available to them, thereby impacting their future income potential.

Distribution of Urban/Rural Residence by Sex
Sex Urban/Rural Residence Count Proportion
Female Urban 2,547 0.28
Female Rural 703 0.08
Female Unknown 32 0.00
Female NA 1,103 0.12
Male Urban 2,680 0.30
Male Rural 781 0.09
Male Unknown 43 0.00
Male NA 1,095 0.12

58% of respondents lived in an urban area at the age of 12. There are no patterns in the data in terms of sex - the type of residence at age 12 was evenly divided between men and women.

ANOVA to Test the Association Between Residence Type and Income

##                       Df        Sum Sq    Mean Sq F value Pr(>F)
## urban.status.age.12    1     231326956  231326956   0.135  0.713
## Residuals           3885 6644997487306 1710424064               
## 5097 observations deleted due to missingness
##                       Df        Sum Sq   Mean Sq F value Pr(>F)
## urban.status.age.12    2     419412992 209706496    0.25  0.779
## Residuals           3820 3201037267016 837967871               
## 5040 observations deleted due to missingness

Respondents with “NA” or “Unknown” residence were reassigned as NA for the purposes of statistical analysis. ANOVA results show there is no statistically significant relationship between the type of residence (urban vs. rural) at age 12 and total income later in life. These results hold for subsets of the data both with and without topcoded income values included.

Number of Children under 6 in the Household

Distribution of the Number of Children by Sex
Sex Number of Children Under the Age of 6 Total Count
Female 0 1,955
Female 1 1,150
Female 2 467
Female 3 84
Female 4 10
Female 5 1
Male 0 2,605
Male 1 710
Male 2 352
Male 3 57
Male 4 5

While it is true that 0 children in the household under age 6 was the highest category for both men and women in 2011, it looks like many more men had 0 children under 6 in the household than women. To be sure this is a statistically significant difference, we ran a t-test.

## 
##  Welch Two Sample t-test
## 
## data:  num.children.under.6 by sex
## t = 12.146, df = 7299.8, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.1835647 0.2542239
## sample estimates:
## mean in group Female   mean in group Male 
##            0.6493046            0.4304103

The results of the t-test show that there is a statiscally significant difference in the mean number of children that men and women have. Specifically, women are more likely to have more children under the age of 6 in their household than men.

Findings

Checks for Collinearity

Group 1: Criminal History Variables

Given that both of the variables related to criminal history capture similar information, we conducted a test of collinearity on the longest.length.incarc.months and the num.times.incarc variables.

There does not appear to be any strong collinearity between the incarceration-related variables.

After testing for collinearity, we also plotted income against the two criminal history variables to get a sense for how they varied with income.

While the incarceration-related variables do not appear to be collinear, there does appear to be some association between each of the variables and the income variable. Particularly, as the number of times someone has been incarcerated increases, their income appears to decline. Additionally, as the length of a participant’s longest period of incarceration increases, their income also appears to decline.

Despite these general trends, there do appear to be a few outliers in the dataset. Particularly, we see that there is an respondent whose longest length of time incarceration is greater than 50 months (approximately 4 years) but they made more than $200,000 in income in 2017, compared to other participants with the same longest length of incarceration, whose income averages to around 50,000 dollars. The effect of these outliers in the incarceration variables is explored further in the analysis distinguishing between when the income outcome variable is topcoded and when it is not topcoded.

Group 2: Education Variables

We were also interested in seeing if the highest grade completed by the residential mother and father of the participants were in any way collinear, given that these variables capture similar types of information. One could expect that as one residential parent has more education, the other parent is more likely to have a comparable level of education.

While there is an association of 0.66 between the educational levels of parents, there does not appear to be strong collinearity in the parental education variables. As discussed in analysis of education variables, the educational achievements of residential parents impact the total income of respondents.

There are notable outliers, however. There are a number of respondents who reported over 100,000 dollars in income while their residential parent(s) completed less than 10 years of school (this corresponds to not completing high school).

Linear Regression Models & Associated Diagnostic Plots

Model 1: Regressing Income on All Variables in Our Subset

Estimate Std. Error t value Pr(>|t|)
(Intercept) -17234 20175 -0.85 0.3931
sexMale 21434 1629 13.16 0.0000
num.times.incarc -3239 1921 -1.69 0.0919
longest.length.incarc.months -158 141 -1.12 0.2636
condition.limiting.workYes -6593 3706 -1.78 0.0754
citizenshipUnknown, not born in the U.S. 10165 4603 2.21 0.0273
citizenshipUnknown, birthplace unknown 14665 3297 4.45 0.0000
urban.status.age.12Rural -3392 1889 -1.80 0.0727
residential.dad.highest.grade.completed 237 341 0.69 0.4875
residential.mom.highest.grade.completed 1313 363 3.62 0.0003
raceBlack -3664 2328 -1.57 0.1156
raceHispanic -1645 2536 -0.65 0.5166
raceMixed Race (Non Hispanic) -14102 9520 -1.48 0.1386
num.children.under.6 -626 1105 -0.57 0.5709
highest.degree.earnedGED 2338 5101 0.46 0.6468
highest.degree.earnedHigh School Diploma 9365 4350 2.15 0.0314
highest.degree.earnedAssociate 17845 5055 3.53 0.0004
highest.degree.earnedBachelors 33999 4586 7.41 0.0000
highest.degree.earnedMasters 36818 5275 6.98 0.0000
highest.degree.earnedPhD 52522 12805 4.10 0.0000
highest.degree.earnedProfessional Degree (DDS, JD, MD) 128980 7990 16.14 0.0000
marital.statusMarried 11246 1838 6.12 0.0000
marital.statusSeparated 1851 7060 0.26 0.7932
marital.statusDivorced 4990 3025 1.65 0.0991
marital.statusWidowed 14510 16965 0.86 0.3925
age.in.2017 423 572 0.74 0.4592

From this initial linear regression model, we can observe that, holding all other variables constant, men make 21,434 dollars more than their female counterparts. This coefficient estimate is statistically significant, with a p-value of 3.643331710^{-38}.

Both of the criminal history-related variables, num.times.incarc and longest.length.incarc.months both appear to be negatively correlated with income. As the number of times an individual has been incarcerated increases and as the length of a respondent’s longest period of incarceration increases, their income appears to decline. Neither of these coefficients appear to be statistically significant at a level of 0.05.

Respondents with an unknown birtplace and those who were not born in the U.S. appear to make more money on average, holding all other variables constant, than their counterparts who are U.S. citizens. A critical fact to note, however, is that individuals with an unknown birthplace make up only 794 of the respondents, so this effect could be overstated given the small sample size, relative to the number of respondents in the study who are citizens.

With regard to impact of race on income, Black, Hispanic, and Mixed Race (Non Hispanic) respondents appear to, on average, make less than their Non-Black, Non-Hispanic counterparts. However, this difference in earnings does not appear to be statistically significant.

The education-related variables appear to have the most number of statistically significant associations with income. As the level of the highest degree that individuals have received increases, their earnings also appear to increase. Study participants with a Professional Degree (DDS, JD, or MD) appear to make, on average, 128,980 dollars more than their counterparts who have not earned a degree. This coefficient estimate is statistically significant, with a p-value of 0.

Finally, individuals who are married appear to make, on average, 11,246 dollars more than their counterparts who are not married. This difference is statistically significant in this linear regression model, with a p-value of 1.105921710^{-9}.

From the R-squared value for this linear regression model, only about 30% of the variance in the data is explained by the model. To conduct further analysis on model fit, we also ran the plot() function to analyze the diagnostic plots for this model.

Residual vs Fitted Plot

The residual vs. fitted plot shows that, for the lower fitted values, there is some clustering around 0 but as the fitted values increase, there is a greater deviation away from a residual of 0. This could be caused, in part, by the top-coding of the income variable. Furthermore, as we observed from the earlier tabular and graphical summaries, the majority of respondents make around $50,000, therefore there are more observations on which to calculate the residuals in this part of the income spectrum. This isn’t the case, however, with the observations in the top income tier of the dataset.

Normal QQ Plot

The normal QQ plot shows a similar trend to the residual vs. fitted plot graph. For the majority of the curve, the residuals match almost perfectly to the diagonal, demonstrating that these residuals appear to be mostly normally distrbuted. Once we hit of a theoretical quantile of about 2 however, the residuals appear to drastically deviate from the diagonal line. Therefore, in the upper tail of the dataset, the assumption of normality does not appear to hold.

Scale-Location Plot

The scale-location plot appears to make a strong case that the assumption of constant variance does not hold for this model. As the fitted values increase, the variance increases steadily. This could also likely to be a side effect of the top-coding of the income variable.

Outliers & The Residuals vs Leverage Plot

The residuals vs leverage plot highlights that there are a few values in the dataset that appear to be outliers, as they lie on or near the Cook’s disance lines. Furthermore, these values have high leverage and high residuals.

Model 2: Addressing the Top-Coding of the Income Variable

The following linear regression is the same as the previous model (referred to as Model #1), where income is regressed on all variables in the subset. The notable exception is that observations with topcoded income are no longer in the subset.

Estimate Std. Error t value Pr(>|t|)
(Intercept) -8198 13740 -0.60 0.5508
sexMale 16267 1108 14.68 0.0000
num.times.incarc -2866 1334 -2.15 0.0318
longest.length.incarc.months -133 92 -1.44 0.1487
condition.limiting.workYes -8840 2544 -3.47 0.0005
condition.limiting.workValid Skip -32868 26134 -1.26 0.2086
condition.limiting.workDon’t Know -42908 26177 -1.64 0.1013
condition.limiting.workRefused to Answer 18055 26267 0.69 0.4919
citizenshipUnknown, not born in the U.S. 5216 3175 1.64 0.1006
citizenshipUnknown, birthplace unknown 5666 2288 2.48 0.0134
urban.status.age.12Rural -2608 1278 -2.04 0.0414
urban.status.age.12Unknown -1555 5629 -0.28 0.7824
residential.dad.highest.grade.completed 133 230 0.58 0.5625
residential.mom.highest.grade.completed 502 248 2.02 0.0431
raceBlack -4167 1571 -2.65 0.0080
raceHispanic 149 1702 0.09 0.9302
raceMixed Race (Non Hispanic) -9654 6395 -1.51 0.1313
num.children.under.6 -1161 756 -1.54 0.1247
highest.degree.earnedGED 5012 3485 1.44 0.1505
highest.degree.earnedHigh School Diploma 11329 2971 3.81 0.0001
highest.degree.earnedAssociate 20901 3453 6.05 0.0000
highest.degree.earnedBachelors 28663 3138 9.13 0.0000
highest.degree.earnedMasters 35334 3627 9.74 0.0000
highest.degree.earnedPhD 59464 8849 6.72 0.0000
highest.degree.earnedProfessional Degree (DDS, JD, MD) 66353 7218 9.19 0.0000
highest.degree.earnedNon-Interview 14490 5671 2.56 0.0107
highest.degree.earnedInvalid Skip 22933 7361 3.12 0.0019
marital.statusMarried 7730 1249 6.19 0.0000
marital.statusSeparated 1276 4601 0.28 0.7815
marital.statusDivorced 3909 2026 1.93 0.0538
marital.statusWidowed 3354 10734 0.31 0.7547
marital.statusInvalid Skip 2705 6840 0.40 0.6925
age.in.2017 546 388 1.41 0.1595

From this linear regression model, we can observe that, holding all other variables constant, men make 16,267 dollars more than their female counterparts. This coefficient estimate is statistically significant, with a p-value of 9.066361110^{-47}.

The results of this model are very similar to Model #1, with some exceptions. The num.times.incarc variable is still negatively associated with income. In this model, unlike in Model #1 however, this relationship is statistically significant. There are some outliers in the comparison of income and criminal history (those with a history of incarceration and very high income). With the removal of these outliers, the model perhaps is better capturing the impact of criminal history on income.

With regard to the impact of race on income, Black, Hispanic, and Mixed Race (Non Hispanic) respondents appear to, on average, make less than their Non-Black, Non-Hispanic counterparts. In this model, the difference in earnings is statistically significant for Black respondents.

From the R-squared value for this linear regression model, only about 25% of the variance in the data is explained by the model. This R-squared is slightly lower than that of Model #1. To conduct further analysis on model fit, we also ran the plot() function to analyze the diagnostic plots for this model.

Residual vs Fitted Plot

The residual vs. fitted plot shows that, for the lower fitted values, there is some clustering around 0 but as the fitted values increase, there is a greater deviation away from a residual of 0. This plot suggests that the model is still overestimating income in lower income brackets and overestimating income in higher income brackets. The deviation in the upper tail of the dataset, however, is not as drastic in Model #2 as it was in Model #1.

Normal QQ Plot

The normal QQ plot shows a similar trend to the residual vs. fitted plot in the first model. Although there are still large deviations in the upper tail, the data in this model appear more normal than in Model #1.

Scale-Location Plot

As the fitted values increase, the variance still increases steadily. The slope of the standardized residuals, although not constant, is smaller than the slope in Model #1.

Outliers & The Residuals vs Leverage Plot

The residuals vs leverage plot show that the outliers in the first plot persist even with the removal of topcoded income.

Overall, it appears that removing the topcoded income values has addressed some of the non-constant variance from Model #1, but not all of it.

Models 3 & 4: Distilling the Impact of Gender & Criminal History on Income

Adding An Interaction Term between sex and num.times.incarc

From the tabular and graphical summaries conducted earlier, it was evident that were was a difference in the incarceration rate of respondents in the study based on their gender. Particuarly, more men than women in the study had been incarcerated and, among the population that had been incarcerated, men tended to serve longer sentences than women. With this information, we decided to take a look at the gender differences in income based on criminal history.

Estimate Std. Error t value Pr(>|t|)
(Intercept) -8230 13743 -0.60 0.5493
sexMale 16305 1116 14.61 0.0000
num.times.incarc -1663 4367 -0.38 0.7034
longest.length.incarc.months -133 92 -1.44 0.1511
condition.limiting.workYes -8851 2545 -3.48 0.0005
condition.limiting.workValid Skip -32837 26139 -1.26 0.2092
condition.limiting.workDon’t Know -42909 26182 -1.64 0.1014
condition.limiting.workRefused to Answer 18083 26273 0.69 0.4914
citizenshipUnknown, not born in the U.S. 5230 3176 1.65 0.0998
citizenshipUnknown, birthplace unknown 5665 2289 2.47 0.0134
urban.status.age.12Rural -2615 1278 -2.05 0.0409
urban.status.age.12Unknown -1545 5630 -0.27 0.7838
residential.dad.highest.grade.completed 133 231 0.58 0.5651
residential.mom.highest.grade.completed 503 248 2.03 0.0427
raceBlack -4157 1571 -2.65 0.0082
raceHispanic 148 1703 0.09 0.9307
raceMixed Race (Non Hispanic) -9654 6396 -1.51 0.1313
num.children.under.6 -1153 757 -1.52 0.1276
highest.degree.earnedGED 5009 3486 1.44 0.1509
highest.degree.earnedHigh School Diploma 11314 2972 3.81 0.0001
highest.degree.earnedAssociate 20896 3453 6.05 0.0000
highest.degree.earnedBachelors 28654 3139 9.13 0.0000
highest.degree.earnedMasters 35323 3627 9.74 0.0000
highest.degree.earnedPhD 59462 8850 6.72 0.0000
highest.degree.earnedProfessional Degree (DDS, JD, MD) 66356 7220 9.19 0.0000
highest.degree.earnedNon-Interview 14541 5675 2.56 0.0105
highest.degree.earnedInvalid Skip 22922 7363 3.11 0.0019
marital.statusMarried 7734 1249 6.19 0.0000
marital.statusSeparated 1285 4602 0.28 0.7801
marital.statusDivorced 3896 2026 1.92 0.0546
marital.statusWidowed 3372 10737 0.31 0.7535
marital.statusInvalid Skip 2728 6841 0.40 0.6902
age.in.2017 547 388 1.41 0.1593
sexMale:num.times.incarc -1295 4476 -0.29 0.7724

From the results of adding this interaction term, we can observe that there is a difference in the impact of incarceration on income depending on the study participants’ gender. Specifically, we see that for women, each additional incarceration results in a 1,663 dollar decline in their income. However, for men, each additional incarceration results in a 2,958 dollar decline in their income. Although these findings are in line with our tabular and graphical summaries for num.times.incarc, neither one of these findings is statistically significant.

In this model, we still observe that the sex, citizenship, highest grade completed by mom and dad, the highest degree earned, and the married marital status variables have a statistically significant assocations with income.

Testing the Significance of Adding the Sex * Number of Times Incarcerated Term

Now that the sex * num.times.incarc interaction variable was added to the regression, we also decided to test whether this addition was statistically signfiicant using an ANOVA test.

## Analysis of Variance Table
## 
## Model 1: income ~ sex + num.times.incarc + longest.length.incarc.months + 
##     condition.limiting.work + citizenship + urban.status.age.12 + 
##     residential.dad.highest.grade.completed + residential.mom.highest.grade.completed + 
##     race + num.children.under.6 + highest.degree.earned + marital.status + 
##     age.in.2017
## Model 2: income ~ sex + num.times.incarc + longest.length.incarc.months + 
##     condition.limiting.work + citizenship + urban.status.age.12 + 
##     residential.dad.highest.grade.completed + residential.mom.highest.grade.completed + 
##     race + num.children.under.6 + highest.degree.earned + marital.status + 
##     age.in.2017 + sex:num.times.incarc
##   Res.Df           RSS Df Sum of Sq      F Pr(>F)
## 1   2346 1595000619595                           
## 2   2345 1594943701328  1  56918266 0.0837 0.7724

From this analysis, we can see that because the p-value for this test is 0.7723895, we would fail to reject the null hypothesis that there is no income gap between men and women based on the number of times that they are incarcerated. Put simply, the data suggests that the income gap between men and women does not vary based on the number of times an individual has been incarcerated.

Adding An Interaction Term of Sex * Longest.Length.Incarc.Months

We also saw from the initial tabular and graphical summaries that men, on average, served longer sentences when they were incarcerated than women. Therefore, we also wanted to add an interaction term between sex and longest.length.incarc.months to see if there were any differences in income by sex based on the longest length of time that an individual had been incarcerated.

Estimate Std. Error t value Pr(>|t|)
(Intercept) -8564 14115 -0.61 0.5441
sexMale 16599 3034 5.47 0.0000
num.times.incarc -1210 6160 -0.20 0.8443
longest.length.incarc.months -205 702 -0.29 0.7702
condition.limiting.workYes -8854 2546 -3.48 0.0005
condition.limiting.workValid Skip -32831 26145 -1.26 0.2093
condition.limiting.workDon’t Know -42907 26188 -1.64 0.1015
condition.limiting.workRefused to Answer 18078 26278 0.69 0.4915
citizenshipUnknown, not born in the U.S. 5231 3177 1.65 0.0998
citizenshipUnknown, birthplace unknown 5667 2289 2.48 0.0134
urban.status.age.12Rural -2611 1279 -2.04 0.0414
urban.status.age.12Unknown -1542 5631 -0.27 0.7843
residential.dad.highest.grade.completed 133 231 0.58 0.5646
residential.mom.highest.grade.completed 503 248 2.03 0.0426
raceBlack -4161 1572 -2.65 0.0082
raceHispanic 147 1703 0.09 0.9312
raceMixed Race (Non Hispanic) -9656 6398 -1.51 0.1313
num.children.under.6 -1153 757 -1.52 0.1278
highest.degree.earnedGED 5027 3491 1.44 0.1500
highest.degree.earnedHigh School Diploma 11312 2973 3.81 0.0001
highest.degree.earnedAssociate 20897 3454 6.05 0.0000
highest.degree.earnedBachelors 28653 3139 9.13 0.0000
highest.degree.earnedMasters 35321 3628 9.74 0.0000
highest.degree.earnedPhD 59460 8852 6.72 0.0000
highest.degree.earnedProfessional Degree (DDS, JD, MD) 66354 7221 9.19 0.0000
highest.degree.earnedNon-Interview 14541 5676 2.56 0.0105
highest.degree.earnedInvalid Skip 22923 7364 3.11 0.0019
marital.statusMarried 7734 1249 6.19 0.0000
marital.statusSeparated 1283 4603 0.28 0.7804
marital.statusDivorced 3903 2028 1.92 0.0544
marital.statusWidowed 3373 10739 0.31 0.7535
marital.statusInvalid Skip 2727 6843 0.40 0.6902
age.in.2017 548 389 1.41 0.1587
sexMale:num.times.incarc -1759 6314 -0.28 0.7806
sexMale:longest.length.incarc.months 74 708 0.10 0.9170

The results from this regression model indicate that for each additional month that a woman serves as part of a criminal sentence, her income decreases by 205 dollars. For men, however, their income decreases by 131 dollars. This suggests that the longest.length.incarc.months depresses the income of women to a slightly higher degree than it does for men. However, neither the interaction term with longest.length.incarc.months or the variable itself are statistically significant in this model.

## Analysis of Variance Table
## 
## Model 1: income ~ sex + num.times.incarc + longest.length.incarc.months + 
##     condition.limiting.work + citizenship + urban.status.age.12 + 
##     residential.dad.highest.grade.completed + residential.mom.highest.grade.completed + 
##     race + num.children.under.6 + highest.degree.earned + marital.status + 
##     age.in.2017 + sex:num.times.incarc
## Model 2: income ~ sex + num.times.incarc + longest.length.incarc.months + 
##     condition.limiting.work + citizenship + urban.status.age.12 + 
##     residential.dad.highest.grade.completed + residential.mom.highest.grade.completed + 
##     race + num.children.under.6 + highest.degree.earned + marital.status + 
##     age.in.2017 + sex:num.times.incarc + sex:longest.length.incarc.months
##   Res.Df           RSS Df Sum of Sq      F Pr(>F)
## 1   2345 1594943701328                           
## 2   2344 1594936307897  1   7393431 0.0109  0.917

An ANOVA test also shows that the addition of this interaction term did not have a statistically significant impact on explaining income gaps observed in the model.

Model 5: Distilling the Impact of Gender & Race on Income

Is Race Statistically Significant?

Based on the tabular summaries and graphs we saw above, race seems to have a considerable effect on your income but we cannot be certain of this effect until we run an ANOVA test.

## Analysis of Variance Table
## 
## Model 1: income ~ sex + num.times.incarc + longest.length.incarc.months + 
##     condition.limiting.work + citizenship + urban.status.age.12 + 
##     residential.dad.highest.grade.completed + residential.mom.highest.grade.completed + 
##     num.children.under.6 + highest.degree.earned + marital.status + 
##     age.in.2017
## Model 2: income ~ sex + num.times.incarc + longest.length.incarc.months + 
##     condition.limiting.work + citizenship + urban.status.age.12 + 
##     residential.dad.highest.grade.completed + residential.mom.highest.grade.completed + 
##     race + num.children.under.6 + highest.degree.earned + marital.status + 
##     age.in.2017
##   Res.Df           RSS Df  Sum of Sq      F Pr(>F)
## 1   2259 3204242937410                            
## 2   2256 3197889077219  3 6353860191 1.4941 0.2142

When we run the ANOVA test we see the p-value of 0.2141735 is large enough that race cannot be assessed to be a statistically significant predictor of income when income is topcoded. But that doesn’t tell the full story.

Viewing the Effects of Top-Coding on the Interaction between Race and Income

We created a second linear model where we removed the top-coding of income earlier, but how much was that top-coding interacting with our race variable? We ran an ANOVA to test if race is still not statistically significant if we remove top coding.

## Analysis of Variance Table
## 
## Model 1: income ~ sex + num.times.incarc + longest.length.incarc.months + 
##     condition.limiting.work + citizenship + urban.status.age.12 + 
##     residential.dad.highest.grade.completed + residential.mom.highest.grade.completed + 
##     num.children.under.6 + highest.degree.earned + marital.status + 
##     age.in.2017
## Model 2: income ~ sex + num.times.incarc + longest.length.incarc.months + 
##     condition.limiting.work + citizenship + urban.status.age.12 + 
##     residential.dad.highest.grade.completed + residential.mom.highest.grade.completed + 
##     race + num.children.under.6 + highest.degree.earned + marital.status + 
##     age.in.2017
##   Res.Df           RSS Df  Sum of Sq      F Pr(>F)
## 1   2259 3204242937410                            
## 2   2256 3197889077219  3 6353860191 1.4941 0.2142

Now when we run the ANOVA test we see the p-value of 0.0233493 is sufficiently small for us to say that race actually is a statistically significant predictor of income, even when taken with the rest of our model. This shows that top-coding can have significant effects in changing what variables are taken into consideration.

We see clues of this when we look at our second model reviewing the effects of top coding income. In our topcoded data set the coefficient for Black is not statistically significant, but when we remove top coding from the income variable this is no longer true and the coefficient for Black meets the threshold for statistical significance at at a significance level of 0.05 with a p-value of 0.0080344. This is similar for other coefficients in the dataset that depress income which we will expand on in our analysis.

For this reason we’ll use our linear model without top-coding to test how race effects income.

Adding an interaction term between sex and race

The plots and tabular summaries above indicated that race had a significant impact on income, and that variance was not evenly distributed between the sexes. It is clear that Black men and women make less than Non-Black/Non Hispanic men and women. To explore this further we decided to dig into gender differences in income based on racial background.

Estimate Std. Error t value Pr(>|t|)
(Intercept) -9817 13769 -0.71 0.4759
sexMale 17486 1375 12.72 0.0000
num.times.incarc -2907 1333 -2.18 0.0293
longest.length.incarc.months -115 93 -1.24 0.2143
condition.limiting.workYes -8994 2543 -3.54 0.0004
condition.limiting.workValid Skip -32231 26119 -1.23 0.2173
condition.limiting.workDon’t Know -43268 26177 -1.65 0.0985
condition.limiting.workRefused to Answer 15288 26274 0.58 0.5607
citizenshipUnknown, not born in the U.S. 5212 3174 1.64 0.1006
citizenshipUnknown, birthplace unknown 5811 2289 2.54 0.0112
urban.status.age.12Rural -2400 1282 -1.87 0.0613
urban.status.age.12Unknown -1524 5625 -0.27 0.7865
residential.dad.highest.grade.completed 121 230 0.52 0.6007
residential.mom.highest.grade.completed 509 248 2.05 0.0402
raceBlack -708 2160 -0.33 0.7433
raceHispanic 253 2258 0.11 0.9109
raceMixed Race (Non Hispanic) -1134 11725 -0.10 0.9230
num.children.under.6 -1308 758 -1.73 0.0844
highest.degree.earnedGED 4741 3486 1.36 0.1739
highest.degree.earnedHigh School Diploma 11194 2970 3.77 0.0002
highest.degree.earnedAssociate 20781 3451 6.02 0.0000
highest.degree.earnedBachelors 28508 3137 9.09 0.0000
highest.degree.earnedMasters 35162 3626 9.70 0.0000
highest.degree.earnedPhD 59267 8845 6.70 0.0000
highest.degree.earnedProfessional Degree (DDS, JD, MD) 66352 7217 9.19 0.0000
highest.degree.earnedNon-Interview 13728 5677 2.42 0.0157
highest.degree.earnedInvalid Skip 22896 7356 3.11 0.0019
marital.statusMarried 7981 1252 6.38 0.0000
marital.statusSeparated 1355 4598 0.29 0.7682
marital.statusDivorced 4207 2028 2.07 0.0381
marital.statusWidowed 4515 10739 0.42 0.6742
marital.statusInvalid Skip 2651 6835 0.39 0.6982
age.in.2017 578 388 1.49 0.1370
sexMale:raceBlack -7177 3064 -2.34 0.0192
sexMale:raceHispanic -144 2768 -0.05 0.9584
sexMale:raceMixed Race (Non Hispanic) -12211 13960 -0.87 0.3818

The summary of this regression model with the addition of an interaction term between race and sex offers some very interesting results that ripple across all aspects of our analysis.

How Large is the Income Gap Between Different Racial Groups?

Adding the sex-race interaction offer some interesting insights into how race and sex affect income, and how income gaps vary across different racial groups.

Mixed Race, Non-Hispanic men make 13,345 dollars less than their Non-Black, Non-Hispanic counterparts. This is the largest income differential that we observed by analyzing the income gap by sex and race.

The biggest gap in income between women based on race also exists between Mixed Race, Non-Hispanic women and Non-Black, Non-Hispanic women. Mixed Race, Non-Hispanic women make 1,134 dollars less than their Non-Black, Non-Hispanic counterparts.

Neither of these findings, however, are statistically significant at a significance level of 0.05.

How Large is the Income Gender Gap by Racial Group?

The income gap between men and women is relatively close in size between Hispanic men and Hispanic women with a gap of 17,342 dollars and Non-Black/Non-Hispanic with a gap of 17,486 dollars. These are the largest income gaps, but the coefficient for Hispanic men is not statistically significant.

The income gap shrinks considerably when looking at Black men and women, with a gap of 10,309 dollars.

Other Effects

Another effect of adding an interaction variable between sex and race is that being divorced which was on the edge of statistical significance in the original model, with a p-value of 0.0537522, would now be considered statistically significant with a p-value of 0.038099 when our sex * race interaction interaction variable is added. This makes sense as it follows the general trend showing how a few incomes at the top levels were skewing the results across multiple other variables when they were supressed by a topcoded value.

Residual vs Fitted Plot

The residual vs. fitted plot continue to show that, for the lower fitted values, there is clustering around 0 and as the fitted values increase, deviation away from a residual of 0 increases, suggesting that this model is still overestimating income in lower income brackets, very slightly underestimating income in the middle brackets, and overestimating income in higher income brackets. The deviation in the upper tail is mostly similar to Model #2 but corrects slightly for the overestimation of income in the higher brackets.

Normal QQ Plot

The normal QQ plot shows a very similar trend to the residual vs. fitted plot in the Model #2 as we might expect from looking at the residual vs. fitted and while there are still large deviations in the upper tail, this model corrects very slightly for incomes in higher brackets.

Scale-Location Plot

As the fitted values increase, the variance still increases steadily, holding nearly identically with Model #2 but with some slight variation at the very top and bottom of the range.

Outliers & The Residuals vs Leverage Plot

The residuals vs leverage plot shows that the outliers in Model #2 remain present and adds some additional outliers as your income rises.

Conclusions from the Plot

Adding the sex*race interaction term appears to change very little that removing top coding from income didn’t already change, while it addresses some minor issues with the model it appears to also introduce new outliers making it appear potentially redundant.

How does adding the sex/race interaction variable change predictions for the Income Gender Gap by Race?

Now that we’ve reviewed the analysis of the gender income gap we wanted to see exactly how large this gender based income gap was between different racial groups in the main effects model versus our model adding a sex*race interaction term.

race income.gap
Non-Black/ Non-Hispanic 13177
Black 5356
Hispanic 13563
Mixed Race (Non Hispanic) 14836

It appears the the main effects model is effective in showing general trends but misses the mark when predicting values for the size of the income gap by sex. At 10,309 , the income gap between Black men and women is nearly double what is predicted by the main effects model. However, we see that the similarity in income gap between Hispanic men and women and Non-Black/Non-Hispanic men and women is pretty accurate but it does underestimate the size of that gap by about 4,000 dollars.

The main effects model is most inaccurate when considering Mixed Race, Non-Hispanic respondents. It predicts a 14,836 dollar gap which is far off the 5,275 dollar gap shown by the model. We suspect this is due to the small sample size present of Mixed Race, Non-Hispanic respondents as compared to other groups.

Testing the Significance of Adding an Interaction Variable on the Sex * Race Term

In order to determine how we should proceed with the insights gleaned from adding a sex * raceinteraction variable we tested the statistical significance of our findings with an ANOVA test.

## Analysis of Variance Table
## 
## Model 1: income ~ sex + num.times.incarc + longest.length.incarc.months + 
##     condition.limiting.work + citizenship + urban.status.age.12 + 
##     residential.dad.highest.grade.completed + residential.mom.highest.grade.completed + 
##     race + num.children.under.6 + highest.degree.earned + marital.status + 
##     age.in.2017
## Model 2: income ~ sex + num.times.incarc + longest.length.incarc.months + 
##     condition.limiting.work + citizenship + urban.status.age.12 + 
##     residential.dad.highest.grade.completed + residential.mom.highest.grade.completed + 
##     race + num.children.under.6 + highest.degree.earned + marital.status + 
##     age.in.2017 + sex:race
##   Res.Df           RSS Df  Sum of Sq      F  Pr(>F)  
## 1   2346 1595000619595                               
## 2   2343 1590698444288  3 4302175307 2.1123 0.09663 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Reviewing our results, we see that while the p-value of 0.096629 meets what we wouldlook for when assessing for 90% confidence of statistical significance, it falls short of the threshold required to reject the null hypothesis. Therefore we fail to reject the null hypothesis that the income gap is the same across all racial categories at a significance level of 0.05.

Model 6: Distilling the Impact of Marital Status by Sex on Income

Is the Effect of Marital Status Statistically Significant?:

In Model #1 and Model #2 we see that being married has a statistically significant effect on your income. Notably though, in Model 1 being divorced does not have a significant effect and divorce falls just slightly out of the range of having a statistically significant effect in Model #2. In Model #3 divorce also becomes statistically significant. To see if marital status is a statistically significant variable in our model we ran an ANOVA test.

## Analysis of Variance Table
## 
## Model 1: income ~ sex + num.times.incarc + longest.length.incarc.months + 
##     condition.limiting.work + citizenship + urban.status.age.12 + 
##     residential.dad.highest.grade.completed + residential.mom.highest.grade.completed + 
##     race + num.children.under.6 + highest.degree.earned + age.in.2017
## Model 2: income ~ sex + num.times.incarc + longest.length.incarc.months + 
##     condition.limiting.work + citizenship + urban.status.age.12 + 
##     residential.dad.highest.grade.completed + residential.mom.highest.grade.completed + 
##     race + num.children.under.6 + highest.degree.earned + marital.status + 
##     age.in.2017
##   Res.Df           RSS Df   Sum of Sq      F       Pr(>F)    
## 1   2351 1621581006295                                       
## 2   2346 1595000619595  5 26580386700 7.8191 0.0000002608 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

When we run the ANOVA test we see the p-value of 0.0000003 is sufficiently small for us to say that marital status is a statistically significant predictor of income, and is influencing our model.

Adding an interaction term between sex and marital status

Our tabular and graphical summaries above illustrated clearly that marital status is not uniform between men and women, with more women being married and divorced. The ANOVA test above additionally tells us that marital status has a statistically significant effect on income. With these facts in mind, we created an interaction variable for sex * marital.status to see how marital status interacts with sex to influence our model.

Estimate Std. Error t value Pr(>|t|)
(Intercept) -4288 13663 -0.31 0.7536
sexMale 7580 1884 4.02 0.0001
num.times.incarc -2568 1327 -1.94 0.0530
longest.length.incarc.months -117 92 -1.28 0.2002
condition.limiting.workYes -8927 2528 -3.53 0.0004
condition.limiting.workValid Skip -30238 25942 -1.17 0.2439
condition.limiting.workDon’t Know -39026 25989 -1.50 0.1333
condition.limiting.workRefused to Answer 19920 26119 0.76 0.4457
citizenshipUnknown, not born in the U.S. 5288 3152 1.68 0.0936
citizenshipUnknown, birthplace unknown 6019 2275 2.65 0.0082
urban.status.age.12Rural -2386 1269 -1.88 0.0601
urban.status.age.12Unknown -1231 5591 -0.22 0.8257
residential.dad.highest.grade.completed 92 229 0.40 0.6877
residential.mom.highest.grade.completed 589 247 2.39 0.0169
raceBlack -5132 1567 -3.27 0.0011
raceHispanic 242 1691 0.14 0.8863
raceMixed Race (Non Hispanic) -9914 6351 -1.56 0.1186
num.children.under.6 -1277 752 -1.70 0.0894
highest.degree.earnedGED 5418 3462 1.57 0.1177
highest.degree.earnedHigh School Diploma 11556 2949 3.92 0.0001
highest.degree.earnedAssociate 21354 3429 6.23 0.0000
highest.degree.earnedBachelors 28629 3116 9.19 0.0000
highest.degree.earnedMasters 35075 3602 9.74 0.0000
highest.degree.earnedPhD 59911 8785 6.82 0.0000
highest.degree.earnedProfessional Degree (DDS, JD, MD) 66555 7166 9.29 0.0000
highest.degree.earnedNon-Interview 14157 5634 2.51 0.0121
highest.degree.earnedInvalid Skip 23532 7307 3.22 0.0013
marital.statusMarried 296 1773 0.17 0.8673
marital.statusSeparated 3119 6634 0.47 0.6383
marital.statusDivorced -1142 2762 -0.41 0.6793
marital.statusWidowed -10762 11714 -0.92 0.3583
marital.statusInvalid Skip -11996 11683 -1.03 0.3046
age.in.2017 553 386 1.44 0.1513
sexMale:marital.statusMarried 13727 2354 5.83 0.0000
sexMale:marital.statusSeparated -3736 9123 -0.41 0.6822
sexMale:marital.statusDivorced 9161 3990 2.30 0.0218
sexMale:marital.statusWidowed 64308 28500 2.26 0.0241
sexMale:marital.statusInvalid Skip 23355 14314 1.63 0.1029

How Large is the Income Gap Between Marital Statuses?

From the summary we see a number of very interesting results.

We see that being married has a strong positive impact on income with married men making 4,566 dollars more on average than men who are divorced and significantly more than separated men at 17,463 dollars.

This positive correlation with income is also true for married women who on average make 1,438 dollars more than their counterparts who are divorced.

However being separated has far more positive correlation with income for women than men, with separated women making on average 2,823 dollars more than married women.

How Large is the Income Gender Gap between Different Marital Statuses?

We also see some interesting results regarding the gender differential for similar marital statuses. First, most strikingly, we see a striking 71,888 dollar income gap between men who have been widowed and women who have been widowed, and the result is highly statistically significant, suggesting losing a spouse has a far more significant impact on women’s income than mens.

It is important to note that the sample size for women and men who were widowed is very small. In fact, only 4 men and 19 women in the entire dataset have been widowed. These small sample sizes likley explain at least some of the high irregularity of the result regarding widows compared to other marital status groups.

This is a fair bit larger than the 21,307 dollar difference for married men and women who appear most frequently and in our model, and the difference is statistically significant.

But both outsize the only 3,844 dollar income gap between men and women who are classified as separated. While the income gap persists across all marital statuses, the relative size of that income gap is heavily dependent on the specifics of the marital status.

Residual vs Fitted Plot

The residual vs. fitted plot continues to show that, for the lower fitted values, there is clustering around 0 and as the fitted values increase, deviation away from a residual of 0 increases, suggesting that this model is still overestimating income in lower income brackets. However, adding the interaction variable between sex and marital status does seem to correct much of the overestimation of income in the higher brackets.

Normal QQ Plot

The normal QQ plot shows a mostly similar trend to the residual vs. fitted plot in Models 2 and 3, but as the residual vs. fitted plot alludes to there is measurable correction for incomes in the higher tail of the range, but there is still a trend of significant deviation.

Scale-Location Plot

As the fitted values increase, the variance still increases steadily like Model #2 and #3 but the increase is sharper at the bottom of the range and slightly flatter at the top.

Outliers & The Residuals vs Leverage Plot

The residuals vs leverage plot show that many of the outliers in Model #2 remain present but it corrects for some of the outliers toward the top of the income range with standardized residuals below 0.

Conclusions from the Plot

Adding the sex*marital coefficient appears to make small but meaningful changes at the higher end of the income spectrum that removing top coding from income didn’t address. It did not solve some of the issues with the model at the lower end of the range, but the better fit at the top of the range suggest the addition of a sex*marital interaction term improves the model in a meaningful way.

Predicted Income Gender Gap between Different Marital Statuses with and without the Interaction Variable

marital.status income.gap
Never-marrried 5804
Married 17652
Separated -4974
Divorced 10690
Widowed 31667
Non-Interview NaN
Invalid Skip 12499

We see from looking at the predicted income gap from the main effects model that while the trend of the values remains the same, the addition of an interaction variable accounts for more of a significant gap in income between widowed men and women and offers a clearer picture of the income gap between married men and women.

Considering the way in which the marital*sex interaction term improved the fit of the topcoded model (Model #2) at the top of the income range, the very high value of the widowed coefficient for men at 7,580 dollars suggests that a small number of high earners who have been widowed were potentially skewing the results of the model slightly.

The main effects model does however catch the positive effect that separation has on womens income and the negative effect it has on mens and is one of the lone negative coefficients for men’s income we’ve seen in our analysis.

Testing the Significance of Adding an Interaction Variable on the Sex * Marital Status Term

To determine if the effects of the addition of the sex^marital coefficient to the model were statistically significant we ran an ANOVA test comparing our two models.

## Analysis of Variance Table
## 
## Model 1: income ~ sex + num.times.incarc + longest.length.incarc.months + 
##     condition.limiting.work + citizenship + urban.status.age.12 + 
##     residential.dad.highest.grade.completed + residential.mom.highest.grade.completed + 
##     race + num.children.under.6 + highest.degree.earned + marital.status + 
##     age.in.2017
## Model 2: income ~ sex + num.times.incarc + longest.length.incarc.months + 
##     condition.limiting.work + citizenship + urban.status.age.12 + 
##     residential.dad.highest.grade.completed + residential.mom.highest.grade.completed + 
##     race + num.children.under.6 + highest.degree.earned + marital.status + 
##     age.in.2017 + sex:marital.status
##   Res.Df           RSS Df   Sum of Sq      F       Pr(>F)    
## 1   2346 1595000619595                                       
## 2   2341 1567735263642  5 27265355953 8.1427 0.0000001246 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We see that the p-value of 0.0000001 meets the threshold required to reject the null hypothesis. Therefore we can reject the null hypothesis that the income gap between men and women is the same across all marital statuses and instead conclude that marital status has a statistically significant effect on the income gap between the sexes.

Model 7: Distilling the Impact of Education on Income Across Sexes

Is the effect of greater education statistically significant?

The most obvious factor for variance in income between individuals both intuitively and from the tabular summaries and plots above is education. It appears from looking at other models that the difference in income scales upward as you receive more education, so we decided to look further into the highest degree held variable to better understand the effect education has on income for the different sexes. The first question we asked is if education is statistically significant and if so, how significant is it?

## Analysis of Variance Table
## 
## Model 1: income ~ sex + num.times.incarc + longest.length.incarc.months + 
##     condition.limiting.work + citizenship + urban.status.age.12 + 
##     residential.dad.highest.grade.completed + residential.mom.highest.grade.completed + 
##     race + num.children.under.6 + marital.status + age.in.2017
## Model 2: income ~ sex + num.times.incarc + longest.length.incarc.months + 
##     condition.limiting.work + citizenship + urban.status.age.12 + 
##     residential.dad.highest.grade.completed + residential.mom.highest.grade.completed + 
##     race + num.children.under.6 + highest.degree.earned + marital.status + 
##     age.in.2017
##   Res.Df           RSS Df    Sum of Sq      F    Pr(>F)    
## 1   2355 1810482639239                                     
## 2   2346 1595000619595  9 215482019644 35.216 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

When we run the ANOVA test we see the p-value of 7.939487110^{-59} is sufficiently small for us to say that your highest degree earned is a highly statistically significant predictor of income, and is strongly influencing our model.

Adding an Interaction Term Between Sex and Highest Degree Earned

The tabular summaries and plots above indicate a wide variation between educational background of respondents. While it is clear that education significantly affects income, as the ANOVA test confirmed, the nuances of how specific educational backgrounds affect income is less clearly discernible from the data. To better understand this we added an interaction variable for sex * highest.degree.earned to see how educational background interacts with sex to influence our model.

Estimate Std. Error t value Pr(>|t|)
(Intercept) -10041 14217 -0.71 0.4801
sexMale 19734 5851 3.37 0.0008
num.times.incarc -2811 1345 -2.09 0.0367
longest.length.incarc.months -133 92 -1.44 0.1499
condition.limiting.workYes -8955 2549 -3.51 0.0005
condition.limiting.workValid Skip -32121 26169 -1.23 0.2198
condition.limiting.workDon’t Know -43278 26206 -1.65 0.0988
condition.limiting.workRefused to Answer 16295 26351 0.62 0.5364
citizenshipUnknown, not born in the U.S. 5609 3189 1.76 0.0787
citizenshipUnknown, birthplace unknown 5968 2300 2.59 0.0095
urban.status.age.12Rural -2620 1280 -2.05 0.0408
urban.status.age.12Unknown -1574 5645 -0.28 0.7803
residential.dad.highest.grade.completed 144 231 0.62 0.5329
residential.mom.highest.grade.completed 501 248 2.02 0.0438
raceBlack -4317 1575 -2.74 0.0062
raceHispanic -49 1709 -0.03 0.9772
raceMixed Race (Non Hispanic) -9356 6408 -1.46 0.1444
num.children.under.6 -1126 759 -1.48 0.1381
highest.degree.earnedGED 10487 5736 1.83 0.0676
highest.degree.earnedHigh School Diploma 12772 4830 2.64 0.0082
highest.degree.earnedAssociate 23703 5422 4.37 0.0000
highest.degree.earnedBachelors 31045 4908 6.33 0.0000
highest.degree.earnedMasters 39375 5465 7.20 0.0000
highest.degree.earnedPhD 59078 10989 5.38 0.0000
highest.degree.earnedProfessional Degree (DDS, JD, MD) 70255 9551 7.36 0.0000
highest.degree.earnedNon-Interview 14833 7246 2.05 0.0408
highest.degree.earnedInvalid Skip 27112 10935 2.48 0.0132
marital.statusMarried 7760 1251 6.21 0.0000
marital.statusSeparated 1730 4617 0.37 0.7079
marital.statusDivorced 3981 2031 1.96 0.0500
marital.statusWidowed 3536 10767 0.33 0.7426
marital.statusInvalid Skip 2729 6851 0.40 0.6904
age.in.2017 531 389 1.36 0.1725
sexMale:highest.degree.earnedGED -8668 7207 -1.20 0.2292
sexMale:highest.degree.earnedHigh School Diploma -2116 6083 -0.35 0.7280
sexMale:highest.degree.earnedAssociate -4614 6995 -0.66 0.5096
sexMale:highest.degree.earnedBachelors -3881 6191 -0.63 0.5308
sexMale:highest.degree.earnedMasters -7640 7139 -1.07 0.2846
sexMale:highest.degree.earnedPhD 4977 18958 0.26 0.7929
sexMale:highest.degree.earnedProfessional Degree (DDS, JD, MD) -8082 14703 -0.55 0.5826
sexMale:highest.degree.earnedNon-Interview 594 7570 0.08 0.9375
sexMale:highest.degree.earnedInvalid Skip -7115 14762 -0.48 0.6299

Reviewing the Income Gender Gap by Educational Attainment

Adding the interaction of sex * highest.degree.earned shows that while the income gap by sex persists at all degrees earned, education can shrink it considerably. This appears to be universal with the exception of earning a PhD where the predicted income gap between a male PhD and female PhD is 24,711 dollars. This result is not statistically significant.

However, the income gap based on sex decreases notably for masters degrees at 12,094 dollars and for professionals degrees at 11,652 dollars.

The data also suggests that racial bias runs deep. Despite the presence of the interaction variable between sex * highest.degree.earned reducing the Hispanic coefficient to just 49 dollars and removing its statistical significance, Black men and women can still expect to make 4,317 dollars less than their Non-Black/Non-Hispanic counterparts and the result remains highly statistically significant.

Residual vs Fitted Plot

The residual vs. fitted plot shows for the lower fitted values, there is clustering around 0 and as the fitted values increase, deviation away from a residual of 0 increases, this is very similar to Model #2, but the deviation appears to be slightly more pronounced.

Normal QQ Plot

The normal QQ plot shows an almost identical match to Model #2, suggesting that education already has a great degree of influence on the model and is accounted for in Model #2. It is difficult to find any noticable variation on either upper or lower tail of the range.

Scale-Location Plot

As the fitted values increase, the variance still increases steadily very similarly to Model #2 and #3 but the trends present seem to have shifted slightly rightward, and show the most variation from Models 2 and 3 in the upper end of the income range.

Outliers & The Residuals vs Leverage Plot

The residuals vs leverage plot show that many of the outliers in Model #2 clustered around 0 have a greater spread suggesting more leverage than in the initial model.

Conclusions from the Plot

Adding the sex*education coefficient appears to make very slight changes toward the lower and upper ends of the range with some outliers having additional leverage in the model with the interaction coefficient than without, but the addition of the coefficient does little to correct exisiting issues with the model and appears to be largely redundant.

Predicted Income Gender Gap between Different Educational Attainment Levels with and without the Interaction Variable

We see from looking at the predicted income gap regarding education from the main effects model that the values are actually quite close to what we saw in our model with the exception of the PhD education value, therefore this interaction term between sex * highest.degree.earned may be redundant.

highest.degree.earned income.gap
No Degree Earned 18156
GED 10820
High School Diploma 15988
Associate 11865
Bachelors 15887
Masters 13362
PhD 10750
Professional Degree (DDS, JD, MD) 17290
Non-Interview 11406
Invalid Skip 3601

The estimates in income gaps between men and women of different educational backgrounds is largely accurate. These findings are similar to the findings in our model with the sex*education interaction variable. Education tends to cut the income gap slightly, but it underestimates the effect of a masters degree in reducing income inequality with the interaction model showing an inequality of 12,094 dollars and the main effects model estimating an inequality of 13,362 dollars and similarly it underestimates the effect of a professional degree with the interaction model showing an income inequality for professional degrees of 11,652 dollars as opposed to the 17,290 dollars estimated by the main effects model.

Testing the Significance of Adding an Interaction Variable on the Sex * Highest Degree Earned Status Term

To test whether a sex * highest.degree.earned interaction variable is redundant we tested the new model against our original main effects model with an ANOVA test.

## Analysis of Variance Table
## 
## Model 1: income ~ sex + num.times.incarc + longest.length.incarc.months + 
##     condition.limiting.work + citizenship + urban.status.age.12 + 
##     residential.dad.highest.grade.completed + residential.mom.highest.grade.completed + 
##     race + num.children.under.6 + highest.degree.earned + marital.status + 
##     age.in.2017
## Model 2: income ~ sex + num.times.incarc + longest.length.incarc.months + 
##     condition.limiting.work + citizenship + urban.status.age.12 + 
##     residential.dad.highest.grade.completed + residential.mom.highest.grade.completed + 
##     race + num.children.under.6 + highest.degree.earned + marital.status + 
##     age.in.2017 + sex:highest.degree.earned
##   Res.Df           RSS Df  Sum of Sq      F Pr(>F)
## 1   2346 1595000619595                            
## 2   2337 1591771751744  9 3228867851 0.5267 0.8561

We see that the p-value of 0.8561254 is not even close to the threshold required to reject the null hypothesis. Therefore, we fail to reject the null hypothesis that the income gap between men and women is the same across similar educational attainment.

Discussion

Main Conclusions

The variables for which there appeared to be a statistically significant impact on income (according to the second model in which the values were not topcoded) were: 1. sex 2. num.times.incarc 3. residential.mom.highest.grade.completed 4. race, if the respondent is Black 5. highest.degree.earned and, 6. marital.status, if the respondent is married, and 7. education.

When running our first linear regression model on the dataset with topcoded income variables, there were some significant issues with the fit of the model to the data. Removing the topcoded values was helpful in rectifying many of those problems. Despite the noticeable improvements, however, issues remained with residuals toward the top and bottom of the income range not fitting the model.

In order to address those issues, we looked at some of the most statistically significant factors effecting income and reviewed the data from each. After an initial review we saw significant effects from num.times.incarc, longest.length.incarc.months, race, marital.status, and education in income discrepancies based on sex. We ran regressions on each of these variables and created interaction terms with sex to see the effect.

Models 3 & 4: Number of Times Incarcerated and Length of Incarceration

From the results of the regression, it appears that, despite more men having a criminal history and being incarcerated longer than women, this did not result in a statistically significant difference in income between men and women in the sample of the dataset that was examined.

Model 5: Race

Across all models except, Model #1, a pattern of statistically significant results of lower incomes persisted for Black respondents. Even when interacting race and sex, this pattern persisted. This was not true for other racial groups, suggesting Black respondents face patterns of discrimination which effects their income even when controlling for other factors.

Model 6: Marital Status

The results indicate that marital status does result in a statistically significant difference in earnings between men and women and that the difference in earnings is not uniform across the various marital statuses, with female widowers seeing a steeper drop in earnings than their male counterparts, and women who are separated seeing a boost in income where their male counterparts saw a steep drop.

The addition of the marital.status*sex term corrected for some of the deviations from the model at the top of the income range and the combination of the removal of top coding from the income variable and addition of the interaction term marital.status*sex produced our most accurate model.

However, the small sample size of widowed and separated participants likely effects the reliability of these results.

Model 7: Education

We additionally see that education is the most effective way to increase income for both men and women. Specifically, as educational attainment rises, the income gap on the basis of sex generally shrinks. However, there are diminishing returns to educational attainment, as the income gap between men and women is greatest between male and female PhDs.

Other Potential Confounders

The geographical region in which this analysis was done could serve as a confounder in the models. It appeared that most of the participants selected for this study appeared to have lived (at the age of 12) in an urban area, which could have influenced the kinds of resources they had access to. It also would have been helpful to have information on the types of crimes that respondents had committed (i.e. misdemeanors, white collar crimes, felonies) to get a better understanding of the influence of criminal history and gender on income. There are greater barriers to entry in the job market for individuals who have felonies than misdemeanors which could have explained some of the outlier behavior we observed, in terms of the individuals who had substantial incarceration histories but had high incomes.

Another challenge with this dataset is the time dimension. Most variables are collected at a specific point in time. For example, the urban vs. rural variable is specific to where the respondent lived when they were 12. It is possible that residence type has a stronger impact later in life, but this information wasn’t collected.

Another variable worth discussing is citizenship. The definition of citizenship in the codebook is tied to birthplace. However, respondents could be born outside of the U.S. and be a U.S. citizen. These definitions challenge our model findings for citizenship status.

Variables regarding the presence of children in the household offered only limited context on how children might affect income and did not account for adoption, or children with special needs. Additionally information on age of the children in the house being limited to 6 represents a limiting factor toward understanding the effect childcare has on income.

Model Soundness

Some of our results had clear trends that can be expressed with a reasonable degree of confidence. Results showing discrepancies in income between men and women across racial background, educational attainment, and incarceration status were persistent. There was also a persistent trend showing discrepancies in income on the basis of racial/ethnic background. While these results appear for all groups that were Non-Black/Non-Hispanic, they were most statistically significant for Black respondents. This result was consistent. The sample size of Black respondents was sufficient to make a strong argument that racial discrimination against Black respondents has a statistically significant effect on income.

In many cases where there were significant results skewing the original model, they came from groups with small sample sizes. Analysis on groups like widows, Mixed Race, Non-Hispanic persons, people with lengthy incarcerations, and men and women with multiple children require greater analysis with a larger sample size before a confident assertion of the results found could be stated.

Our models do a reasonably good job of accommodating these questions around sample size, but limitations exist in some areas where troubling trends exist and need to be further studied.

Despite some of the limitations of the dataset, the findings of our analysis could start a conversation about the stark differences in income according to gender. These conversations could allow policymakers to begin to think about potential avenues for intervention that could curb the effect of this trend.